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Abstract 


Implementation  of  certain  algorithms  on  parallel  computing  architectures  can  in¬ 
volve  partitioning  contiguous  elements  into  a  fixed  number  of  groups,  each  of  which 
is  to  be  handled  by  a  single  processor.  It  is  desired  to  find  an  assignment  of  elements 
to  processors  that  minimizes  the  sum  of  the  maximum  workloads  experienced  at  each 
stage.  This  problem  can  bo  viewed  as  a  multi-objective  network  optimization  problem. 
Polynomially- bounded  algorithms  are  developed  for  the  case  of  two-stages,  whereas  the 
associated  decision  problem  (for  an  arbitrary  number  of  stages)  is  shown  to  be  NP- 
complete.  Heuristic  procedures  are  therefore  proposed  and  analyzed  for  the  general 
problem.  Computational  experience  with  one  of  the  exact  problems,  incorporating 
certain  pruning  rules,  is  presented  for  a  variety  of  test  problems.  Empirical  results  also  ✓ 
demonstrate  that  one  of  the  heuristic  procedures  is  especially  effective  in  practice.  ;  'v*  e  ) 
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1.  Introduction 

A  number  of  problems  in  image  processing,  scientific  computing,  and  numerical  optimi¬ 
zation  are  well  suited  for  solution  using  parallel  or  pipelined  computations.  Typically  there  are  far 
fewer  processors  than  elements  to  be  processed,  and  so  some  natural  aggregation  of  elements  is 
required  for  partitioning  the  workload  among  the  processors.  For  example,  a  spatial  aggregation 
of  elements  might  be  appropriate  for  image  processing,  whereas  an  aggregation  based  on  contig¬ 
uous  columns  might  be  appropriate  for  various  matrix  computations.  The  focus  of  the  present 
paper  is  on  this  latter  type  of  aggregation,  in  which  a  linear  ordering  of  elements  is  imposed  and  an 
optimal  partitioning  of  the  elements  is  required  that  achieves  a  balanced  workload  among  the 
processors. 

One  specific  motivating  example  arises  in  studying  parallel  implementations  of  the 

Cholesky  decomposition  of  a  banded  positive  definite  matrix  A,  as  proposed  by  O’Hallaron 

T 

(1988).  The  standard  Cholesky  decomposition  A  =  LL  can  be  computed  in  a  serial  manner  using 
two  alternating  stages.  The  first  stage  carries  out  an  appropriate  normalization  of  the  current 
column;  this  information  is  then  passed  on  to  other  columns  during  the  second  stage.  In 
O'Hallaron’s  pipelined  implementation,  a  set  of  contiguous  columns  (defined  by  the  matrix 
bandwidth)  is  allocated  to  a  smaller  number  of  processors.  Again,  the  Cholesky  scheme  consists 
of  alternating  stages  of  normalization  and  updating  sets  of  columns.  Because  of  the  need  for  global 
synchronization  at  the  end  of  each  stage,  the  computational  speed  is  limited  by  the  processor 
having  the  maximum  workload.  The  overall  speed  is  then  determined  by  the  sum  of  the  maximum 
V  '^rk  loads  at  each  stage,  and  an  allocation  of  columns  to  processors  that  minimizes  this  sum  is 
desired. 

Another  specific  motivating  example  occurs  in  the  parallel  solution  of  one-dimensional  fluid 
flow  problems  using  irregular  grid  hierarchies  (see  Berger  and  Oliger  1984).  Nonuniform 
gridding  is  used  to  concentrate  computational  effort  in  domain  regions  where  the  solution  changes 
rapidly  (as  from  a  shock  or  turbulence).  In  this  grid  hierarchy  various  uniformly  spaced  fine  grids 
which  do  not  span  the  entire  domain  are  superimposed  onto  a  uniformly  spaced  coarse  grid  which 
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does.  The  “stages”  of  such  problems  correspond  to  the  numerical  integration  of  grids  with  a 
common  spatial  separation;  inter  stage  synchronization  is  used  to  enforce  data  dependencies 
between  solutions  at  differing  grid  levels.  The  entire  computation  is  parallelized  by  dividing  the 
domain  (or  equivalently,  the  coarse  grid  points)  among  the  given  processors.  The  linear  ordering 
of  these  grid  points  reflects  the  natural  constraint  that  grid  point  updates  depend  on  contiguous  grid 
values.  As  in  the  Cholesky  decomposition,  the  processor  having  the  heaviest  workload  at  a  given 
stage  limits  the  progress  of  the  entire  system.  The  sum  of  the  maximum  workloads  at  each  stage 
again  measures  the  computation’s  execution  speed. 

The  following  section  provides  a  mathematical  formulation  of  the  general  problem  of 
minimizing  an  overall  objective  based  on  the  sum  of  maximum  workloads.  This  problem  is  termed 
the  Multistage  Linear  Array  Assignment  (MLAA)  problem.  It  is  shown  in  Section  5  that  the 
associated  decision  problem  is  NP-complete  for  a  general  number  (r)  of  stages.  Thus,  we  first 
concentrate  on  providing  efficient  polynomial  algorithms  for  the  MLAA  when  r  =  2.  Two  separate 
approaches  are  developed  (in  Sections  3  and  4)  for  the  two-stage  problem.  Section  6  discusses 
how  one  of  these  algorithms  can  be  generalized  in  a  natural  way  to  r  stages,  and  it  also  develops 
heuristic  procedures  for  an  arbitrary  number  of  stages.  Computational  experience  with  the  exact 
and  heuristic  algorithms  is  presented  in  Section  7. 

2,  Formulation 

It  is  supposed  that  n  elements  or  modules  are  given,  each  of  which  has  two  associated 
nonnegative  processing  times  x^,  y^  >  0  (i  =  1, ....  n).  There  are  also  available  p  processors, 
p  n.  each  of  which  is  capable  of  handling  an  arbitrary  number  of  modules.  We  consider  a 
partition  of  the  set  of  modules  {1,2. ...,  n}  intop  intervals  Ij,  L,  ...,  Ip  where  each  interval 
consists  of  consecutive  modules  (t,  t+  1, ...,  t-t-k},  k  >  0.  A  (quite  reasonable)  assumption  made 
here  is  that  the  processing  time  for  each  interval  is  additive  in  the  processing  times  of  its  component 
modules: 
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The  objective  here  is  to  select  intervals  I,, Ip  for  the  p  processors  to  minimize  z(I,, Ip), 
where 

z(Ip  ■  Ip)  =  [m^  x(ipi  +  [max  y(ipi  .  (1) 

The  extension  to  r  processing  times  (stages)  is  immediate. 

This  problem  can  be  formulated  on  an  acyclic  network  G  =  (N,  E),  where  the  node  set 
N  =  {0,  1 ,  . . n}  and  the  edge  set  E  consists  of  all  edges  (i,  j)  with  i  <  j.  Each  edge  (i,  j)  has  the 
associated  weights  x(i,  j)  and  y(i,  J),  defined  by 

=  X  ’‘k’  (2) 

i<k<j  i<k<j 

Notice  that  G  has  m  =  O(n^)  edges. 

It  is  easily  seen  that  our  two-stage  problem  is  precisely  that  of  finding  in  G  a  path  of  exactly 
p  edges  from  node  0  to  node  n  that  minimizes  the  sum  of  the  maximum  edge  weight  in  each 
component.  This  “min-sum-max”  problem  is  a  bicriterion  path  problem  that  has  not  been 
extensively  studied,  although  related  bicriterion  path  problems  have  been  investigated.  Warburion 
(1985)  considers  the  "min-max-sum”  problem  and  demonstrates  that  it  is  NP-hard  forr  =  2 
components.  The  problem  is  strongly  NP-hard  for  general  r.  A  number  of  authors  (Hansen  1980, 
I  lenig  1985)  have  studied  the  problem  of  generating  “efficient”  (Pareto  optimal)  paths  in  bicriterion 
nct\^orks,  using  a  variety  of  path  length  measures.  The  most  pertinent  measure  of  path  length  to 
our  present  study  is  the  maximum  edge  weight  for  each  criterion.  A  path  is  termed  efficient  when 
no  other  admissible  path  has  a  smaller  value  for  one  criterion  without  a  larger  value  for  the  other. 
Polynomial-time  algorithms  for  this  “minmax-minmax”  bicriterion  path  problem  have  been  given 
by  Hansen  (1980),  Berman  et  al.  (1987),  and  Warburton  (1987).  Other  related  work  is  presented 
by  Bokhari  11988),  relative  to  the  single-objective  problem  of  minimizing  the  maximum  of  the 
maximum  x-weight  and  the  additive  y-weight,  in  a  doubly  weighted  network.  While  the  iwoblem 
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studied  here  is  significantly  different  from  these  other  problems,  certain  common  solution 
strategies  exist  that  have  been  exploited  in  our  algorithmic  development  of  the  next  sections. 

Before  discussing  algorithms  for  the  two-stage  case,  it  is  worthwhile  to  point  out  certain 
difficulties  intrinsic  to  the  problem  of  minimizing  the  objective  function  (1).  Consider  the  two- 
stage  problem  with  processing  times  x-,  given  in  Table  1,  with  p  =  3.  The  (unique)  optimal 
partition  for  this  problem  has  I,  =  { 1 },  Ij  =  {2},  I3  =  {3,  4}.  However,  in  the  subproblem 
defined  over  the  modules  12^13  =  {2, 3,  4},  the  optimal  partition  for  p  =  2  processors  uses 
It  =  {2,  3}  and  13  =  {4}.  Thus  the  principle  of  optimality  required  for  a  dynamic  programming 
formulation  does  not  hold.  In  other  words,  optimal  solutions  cannot  be  found  simply  by  extending 
(in  the  best  possible  way)  the  known  optimal  solutions  for  smaller  subproblems. 

Table  1 

A  Three  Processor  Example  that  Violates  the  Optimality  Principle 


yi 

3.  The  Labeling  Algorithm 

In  this  section  a  labeling  algorithm  is  developed  to  solve  the  two-stage  MLAA  problem  (1). 
This  algorithm  works  directly  on  the  network  G  introduced  in  the  previous  section  and  exploits  the 
fact  that  G  is  acyclic.  Since  the  principle  of  optimality  cannot  be  invoked  for  this  problem  to 
extend  optimal  paths  for  a  subproblem,  we  maintain  at  each  node  j  e  N  several  sets  of  candidate 
paths.  That  is,  asso'^iated  with  each  node  j  will  be  certain  sets  L(j;k)  of  vectors  in  R  .  The  set 


Module  (1) 
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L(j;k)  =  {(x^  y^),  (x-,  y2),  (x‘,  y‘)}  corresponds  to  “path  length  vectors”  for  paths  P  from 

node  0  to  node  j  having  exactly  k  edges.  In  this  context,  the  first  component  x**  of  such  a  path 
vector  is  simply  the  maximum  x- weight  of  path  P,  and  the  second  component  y*  is  the  maximum 
y-weight  of  path  P. 

2 

To  express  these  notions  more  precisely,  let  u,  v  6  R  be  vectors  with  u  =  (Uj,  Uj)  and 
V  =:  (v.,  V.),  and  define  the  product  u  <Ei  v  by 

u  ®  V  =  (max(Uj,  Vj),  maxluj,  v^))  . 

In  this  notation,  the  path  length  vector  for  path  P  is  given  by 

len(P)  =  ®J~J  (x(i,j),  y(i,j)), 

(i.j)€P 

and  the  label  L(j;  k)  on  node  j  is  given  by 

L(j;k)  =  {len(P):  P  is  a  path  from  0  to  j  with  II  P||  =  k  } .  (3) 

Here  x(i,  j)  and  y(i,  j)  are  as  defined  in  (2),  and  ||  P||  is  used  to  denote  the  cardinality  (number  of 
edges)  of  path  P.  It  will  also  be  convenient  to  extend  the  definition  of  ®  to  sets  of  vectors:  if 
S  s  and  v  e  R"  then 

S  ®  V  =  {w  ®  v:  w  6  S}. 

Throughout,  we  will  maintain  sets  of  path  vectors  for  each  node.  At  the  end  of  the 
process,  an  optimal  path  vector  to  node  n  will  be  identified,  from  which  it  is  straightforward  to 
recover  an  associated  path  (via  a  simple  backtracking  scheme).  Consequently,  we  shall  work 
entirely  with  path  vectors.  Not  all  possible  path  vectors  need  to  be  retained  at  a  node,  however. 

To  this  end,  a  vector  x  is  said  to  dominate  vector  y  if  x  <  y  holds  componentwise  and  at  least  one 
of  the  inequalities  is  strict.  (This  nomenclature  is  meant  to  convey  the  idea  that  x  is  “better”  than  y; 
however  usage  in  this  way  is  not  standard  in  the  literature.)  A  set  S  of  vectors  is  efficient  (Pareto 


optimal)  if  no  vector  x  e  S  dominates  another  vector  y  e  S.  More  generally,  the  set 
EFFICIENT(S)  is  obtained  from  set  S  by  removing  all  dominated  vectors. 
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In  the  algorithm  given  below,  only  efficient  sets  L(j;k)  are  maintained  at  each  node.  This 
algonthm  represents  a  suitably  modified  version  of  the  algorithm  ACYCLIC  given  by  Warburton 
( 1 987)  for  obtaining  the  set  of  efficient  path  vectors  in  a  bicriterion  shortest  path  problem. 

LABELING  ALGORITHM 

1.  L(0;0)  =  {(0,0)}; 

L(0;  k)  =  0,  for  k  =  1 ,  . . .,  p. 

2.  For  j  =  1,  ....  n-1 

L(j;O)  =  0; 

For  k  =  1 . p  - 1 

L(j; k)  =  EFFICIENT  { U  L(i; k- 1 )®  (x(i,  j),  y(i,  j)) }  . 

'<J 

3.  L(n;p)  -  EFFICIENTILJ  L(i;p-1)®  (x(i,  n),  y(i,  n))}  . 

i<n 

4.  Select  v  =  (v,,  V2)  €  L(n;p)  for  which  Vj  +  Vj  is  minimum. 

Upon  termination,  the  labeling  algorithm  produces  a  set  of  efficient  path  vectors  L(n;p), 
from  which  an  optimal  p-edge  path  vector  for  (1)  is  readily  obtained  in  Step  4.  If  we  also  maintain 
the  sector  labels  L(n;k)  at  node  n,  sensitivity  analysis  information  will  be  available  on  the  effect  of 
reducing  the  number  of  processors  from  the  fixed  value  p.  Of  course,  optimal  partitions  for  the 
\  arious  subproblems  using  modules  {1, ...,  j},  j  <  n,  are  also  readily  derived  from  the  vector 
labels  on  node  j.  It  should  be  noted  that  the  range  for  k  in  Step  2  can  actually  be  replaced  by 
k  =  maxj  1,  p-n+j},  . ...  min{p- 1,  j}.  This  type  of  pruning  allows  th'*  elimination  of  paths 
whose  cardinality  is  too  small  to  lead  to  an  optimal  path  at  node  n  having  p  edges,  and  it  also 
avoids  the  examination  of  lists  known  to  be  empty. 

The  worst-case  time  complexity  of  the  labeling  algorithm  is  determined  by  the  complexity 
of  calculating  EFFICIENT(S)  in  Steps  2  and  3.  Suppose  that  is  the  maximum  size  of  any  set 
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L(j;k)  produced  during  the  course  of  the  algorithm.  By  keeping  each  L(j;k)  as  a  list  of  vectors, 
ordered  by  their  first  component,  repeated  merge  sorts  suffice  to  maintain  an  ordered  list  without 
dominated  elements.  In  the  worst  case  there  can  be  0(n)  sublists  to  merge,  each  of  size 
resulting  in  0(n  logn  work  to  calculate  EFFICIENT(S).  Overall,  the  complexity  of  carrying 
out  Steps  2  and  3  is  then  0(p  n^  log  n  Since  each  element  of  the  efficient  set  L(j;k)  repre¬ 

sents  a  path  vector  len(P)  and  since  the  first  component  of  len(P)  can  assume  at  most  m  =  |E| 
distinct  values,  <  m  =  0(n“)  and  the  overall  time  complexity  of  the  labeling  algorithm  can 
be  bounded  by  0(p  n'^  logn).  Space  requirements  are  0(p  n  worst  case.  In 

practice,  however,  these  estimates  would  appear  to  be  overly  pessimistic  and  the  expected  perfor¬ 
mance  of  the  algorithm  should  be  significantly  better.  The  empirical  results  presented  in  Section  7 
confirm  this  belief. 

4.  The  Recursive  Bisection  Algorithm 

Tliis  section  describes  a  different  approach  for  solving  the  two-stage  problem,  based  on 
generating  a  set  of  candidate  path  vectors  guaranteed  to  contain  all  efficient  path  vectors  from  node 
0  to  node  n.  The  optimal  solution  vector  can  then  be  easily  obtained  from  this  set.  The  following 
simple  observation  serves  to  motivate  this  second  approach.  Namely,  if  P*  is  an  optimal  path  for 
the  two-stage  problem,  then  the  optimal  objective  function  value  is  given  by 

z*  =  I  max  x(a)l  -i-  [max  y(b)]  =  x(a*)  +  y(b*) , 
a€  P*  be  P* 

where  a*  and  b*  arc  edges  of  P*.  Consider  the  subgraph  of  G 

Ga  =  {e  e  E;  x(e)  <  x(a)} 

consisting  of  all  edges  with  x-weight  at  most  x(a).  Then  it  is  not  difficult  to  find  a  “best”  path  in 
Gg  with  respect  to  the  y-weights,  a  path  P^  in  G^  from  0  to  n  with  cardinality  p  that  minimizes  the 
maximum  y-weight  along  P^: 

y(b*a)  =  max  {y(b)}  . 

b  e  Pj 
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Such  a  minimax  path  can  be  found  using  a  straightforward  modification  of  the  standard  Dijkstra 
shortest  path  algorithm  (see  Kalaba  1964).  Finally,  the  optimal  two-stage  solution  is  found  by 
selecting  a  path  that  minimizes  x(a)  +  y(bj).  This  basic  approach  would  require  m  calls  to  the 
minimax  (p-edge)  path  routine,  each  such  invocation  requiring  (Xp  m)  work,  so  the  overall  time 
complexity  is  bounded  by  0(p  m‘)  =  0(p  n^). 

An  improved  approach  that  treats  the  x-weights  and  y-weights  more  symmetrically,  and 
that  has  a  significantly  better  complexity  bound,  will  now  be  developed.  First,  the  x-weights  on 
the  m  arcs  of  G  are  sorted  into  nondecreasing  order:  x(aj)  <  x(a2)  <  ...  <  x(aj^).  Likewise,  the  y- 
weights  are  sorted  as  y(b[)  <  y(b2)  <  ...  <  y(b^).  Next,  we  define  the  subgraph  of  G  via 

Ga.b  =  {e  e  E:  x(e)  <  x(a),  y(e)  <  y(b)}. 

The  MLAA  problem  can  then  be  solved  by  finding  edges  a,  b  6  E  so  that  lx(a),  y(b)l  is  feasible 
(i.e.,  Gj^j,  admits  a  path  P  from  0  to  n  with  |lP||  =  p)  and  x(a)  +  y(b)  is  minimum  (over  all  feasible 
pairs). 

It  will  be  useful  to  introduce  a  feasibility  matrix  Z,  whose  (i,  j)  entry  indicates  whether 
the  pair  [x(aj),  y(bj)J  is  feasible  (F)  or  not  (I).  Because  of  the  previous  ordering  of  x-weights  and 
y-weights,  whenever  =  F  then  z-  =  F  for  i  >  s  and  j  >  t.  Also,  if  z^j  =  I  then  z-  =  I  for  i  <  s  and 
j  <  t.  .A  typical  pattern  for  the  matrix  Z  is  illustrated  in  Figure  1 .  The  efficient  path  vectors 
[x(a),  y(b)l  are  indicated  by  circled  entries  in  the  matrix. 

■  I  I  I  ©  F  F■ 

I  I  I  F  F  F 

I  I  ©  F  F  F 

^“©FFFFF 
F  F  F  F  F  F 

.  F  F  F  F  F  F  . 


Figure  1 .  Example  feasibility  matrix  Z. 
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We  would  like  to  generate  a  candidate  set  of  entries  (at  most  m),  guaranteed  to  contain  the 
efficient  (circled)  enunes  in  the  feasibility  matrix.  Since  it  is  computationally  expensive  to  deter¬ 
mine  the  values  of  entries  in  this  matrix,  it  is  desirable  to  evaluate  as  few  entries  as  possible.  It  is 
somewhat  surprising  that  we  can  in  fact  determine  the  candidate  set  (having  possibly  m  entries)  by 
evaluating  only  at  most  0(m)  of  the  entries  z-. 

For  notational  convenience,  it  will  be  assumed  that  m  =  2'^  for  some  integer  k  >  0.  Because 
of  the  ordering  properties  of  Z,  derived  from  the  ordering  of  x-weights  and  y-weights,  we  can 
perform  a  binary  search  on  the  elements  of  the  (median)  row  s  =  y  of  Z  to  find  the  smallest  t  so 
that  =  F.  This  will  be  termed  a  binary  search  at  level  0,  and  it  requires  CXlog  m)  evaluations  of 
entries  in  Z.  Entry  (s,  t)  of  the  mairix  then  becomes  a  candidate  solution.  The  ordering  properties 
of  Z  ensure  that  the  submatrix  of  Z  with  upper  left  comer  at  (s,  t)  contains  only  F’s,  while  the 
submatrix  with  lower  right  comer  at  (s,  t-1 )  contains  only  Ts.  Therefore,  it  is  only  necessary  to 
explore  two  remaining  submatrices  for  candidate  solutions:  the  submatrix  with  lower  left  comer  at 
(s-1,  t),  and  the  submatrix  with  upper  right  comer  at  (s+1,  t-1).  Each  submatrix  can  in  turn  be 
explored  by  performing  a  binary  search  along  a  row  in  median  position  within  the  submatrix.  If 
the  first  submatrix  has  Cj  columns  and  the  second  has  Cj  columns,  then  the  number  of  evaluations 
needed  to  carry  out  these  two  binary  searches  is  0(log  c,)-i-  0(log  C2).  By  the  concavity  of  the  log 
function,  y  log  c  j  +  ^  'og  ^  ~  y  »  number  of  evaluations 

required  for  these  two  level  1  searches  is  of  the  order  2  log-^ .  In  a  similar  way,  the  2'  binary 
Ncarches  at  level  I  require  at  most  2*  log  ^  evaluations,  so  the  total  number  of  evaluations  needed 
1',  of  the  order 

log  m  +  2(log  -j)  +4(log  -y)  -t-  ...  +  y  (2)  -t-  (-y)  1  ^  • 

i  =  l  ^ 

Since  the  infinite  series  is  convergent,  at  most  0(m)  entries  of  Z  will  need  to  be  examined.  A 
formal  statement  of  this  algorithm  (RECURSIVE  BISECTION)  is  given  below,  applied  to  an 
m  X  m  feasibility  matrix  Z. 
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RECURSIVE  BISECTION  ALCORITHM 


(Main  Program ( 

C  =  0; 

BISECT  (Z,  m,  m,  C); 

Select  (i.  j)  e  C  with  minimum  x(a,)  +  y(bj). 

[Recursive  Subprogram] 

procedure  BISECT  (M,  rows,  cols,  C) 

s  =  frows/ll; 

Determine  (by  binary  search)  the  smallest  t 
so  that  m^j  =  F;  C  =  C  u  {(s,  t)}; 
if  s  >  1  then 

Let  M,  be  the  submatrix  of  M  with  lower  left 
comer  at  (s-1,  t); 

BISECT  (Ml,  s-l,cols-t+l,C); 
if  s  <  rows  and  t  >  1  then 

Let  M2  be  the  submatrix  of  M  with  upper  right 
comer  at  (s-rl,  t-1); 

BISECT  (Ml,  rows-s,  t-i,  C). 

As  shown  above,  this  algorithm  requires  0(m)  entries  of  Z  to  be  evaluated  in  carrying  out 
the  various  binary  searches  on  partial  rows  of  the  input  matrix.  Once  the  smallest  feasible  entry  in 
the  prescribed  row  is  determined,  the  candidate  set  C  is  updated  appropriately.  We  now  discuss 
how  any  entry  of  the  feasibility  matrix  can  be  efficiently  computed,  when  required. 

Recall  that  the  value  of  an  entry  in  the  feasibility  matrix  is  determined  by  whether  there  is  a 
path  in  cardinality  p  from  0  to  n.  In  view  of  the  definition  (2)  of  edge  weights  and  the 

nonnegativity  of  processing  times,  the  following  monotonicity  property  is  immediate. 


Property  1.  If  i  <  j  <  k  then  x(i,  j).  x(j,  k)  <  x(i,  k)  and  y(i,  j),  y(j,  k)  <  y(i,  k). 


The  next  property  follows  directly  from  Property  1. 


Property  2.  Suppose  that  i  <  j  <  k  and  that  (i,  k)  e  Then  (i,  j)  €  and  (j.  k)  e  G^ 

Proof:  Since  (i,  k)  €  G^^  ^  we  have  x(i,  k)  <  x(a)  and  y(i,  k)  <  y(b).  From  Property  1,  x(i,  j)  < 
x(i.  k)  <  x(a)  and  y(i,  j)  <  y(i,  k)  <  y(b)  and  so  (i,  j)  g  G^jj.  In  a  similar  way,  (j,  k)  g 

We  are  now  in  a  position  to  describe  an  algorithm  for  determining  the  feasibility  of 
1  \(a).  >  (b)).  The  basic  idea  is  to  start  from  node  0  and  to  repeatedly  follow  the  edge  in  G^  that 
reaches  to  the  node  of  largest  index,  until  we  reach  node  n.  If  this  constructed  path  has  cardinality 
<  p  then  it  will  turn  out  that  (x(a),  y(b)]  is  feasible,  and  conversely.  This  greedy  procedure 
(REACH)  for  checking  feasibility  in  G^j^  is  stated  below. 

procedure  KEACH(a,b) 

1 .  s  -  0;  card  =  0; 

2.  While  card  <p 

Find  the  largest  j  so  that  (s,  j)  g  G^  j,; 

card  card  -  1 ; 

if  j  ^  n  then  output  true  and  stop 
else  s  =  j; 

?>.  Output /(//sc. 

The  validity  of  REACH  is  now  established:  namely,  it  will  be  shown  that  the  output  true  is 
produced  precisely  ss  lien  there  is  a  path  of  cardinality  p  in  G^  from  node  0  to  node  n.  Suppose 
that  RliACI  1  outputs  true.  Then  we  are  assured  of  a  path  P  of  cardinality  <  p  in  G^j,  from  0  to  n. 
Since  p  <  n.  Property  2  ensures  that  if  such  a  path  actually  has  cardinality  <  p,  then  there  must  be 
some  path  of  cardinality  exactly  p.  (Just  successively  replace  edges  (i,  j)  g  P  by  edges  (i,  k)  and 
(k.  j )  as  needed.)  On  the  other  hand,  suppose  there  is  a  path  Q  in  G^  ^  of  cardinality  p  from  0  to  n, 
where  Q  =  ((iQ,  i,),  (i,,  ii),  (ip.i-  ip)l.  with  i^  =  0  and  ip  =  n.  Let  the  path  produced  by 
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REACH  be  P  =  ((jo-Ji.).  O1.J2)'  ■  Om.  ji)l.  with  jg  =  0.  Clearly,  since  ij,- jg  and  (i^,  i,)e 
procedure  REACH  will  generate  j,  >  i,.  Since  (i,,  ij)  €  Property  2  guarantees  that 
(j,,  ij)  G  and  thus  the  node  selected  by  REACH  must  satisfy  jj  >  ij.  Continuing  in  this 
way,  we  can  establish  that  >  ij, . . jp  >  ip  =  n.  Hence  REACH  will  detect  a  path  of  cardinality 
at  most  p  to  node  n  and  output  true. 

The  time  complexity  of  REACH,  and  consequently  that  of  the  recursive  bisection 
algorithm,  is  now  easy  to  establish.  First,  we  note  that  the  graph  G^^jj  need  not  be  explicitly 
constructed  in  order  to  carry  out  Step  2  of  REACH.  Namely,  by  Property  1  the  edges  emanating 
from  node  s  are  automatically  ordered  by  their  x-weight  and  by  their  y-weight: 

x(s,  s+1]  <  x(s,  s+2)  <  ...  <  x(s,  n);  y(s,  s+1)  <  y(s,  s+2)  <  ...  <  y(s,  n). 

Thus,  a  binary  search  can  be  used  on  the  n-s  pairs  (s,  s+1), ...,  (s,  n)  to  find  the  largest  index) 
for  which  x(s,  j)  <  x(a)  and  y(s,  j)  <  y(b).  The  time  complexity  of  one  iteration  of  Step  2  is  thus 
0(log  n),  and  REACH  therefore  can  be  implemented  to  run  in  0(p  logn)  time.  As  a  result,  the 
recursive  bisection  algorithm  has  Ofp  m  logn)  =  0(p  n^  logn)  worst-case  time  complexity.  Also, 
the  initial  sorting  of  x-weights  and  y-weights  can  be  done  in  0(m  logm)  =  0(n^  logn)  time,  so  the 
overall  time  complexity  remains  at  0(p  n^  logn).  The  space  requirements  are  dominated  by  the 
0(n“)  storage  needed  to  keep  G  as  well  as  the  sorted  x-weights  and  v-veights. 

In  summary,  the  recursive  bisection  method  presented  in  this  section  exhibits  a  better 
worst-case  time  complexity  than  the  labeling  algorithm  of  Section  3.  Because  the  former  algorithm 
makes  repeated  use  of  binary  searches  (whose  worst-case  and  average-case  behaviors  are  similar). 
It  is  anticipated  that  in  practice  the  empirical  complexity  of  the  algorithm  should  closely  reflect  its 
worst-case  bound.  This  is  in  contrast  to  the  labeling  algorithm,  whose  empirical  performance  will 
depend  on  the  number  of  labels  actually  encountered. 

A  final  observation  concerns  the  additive  assumption  made  for  combining  processing  times 
into  edge  weights  xfi,  j)  and  y(i,  j).  The  specific  (additive)  way  of  combining  the  Xj^’s  and  the  yj^’s 
in  (2)  is  by  no  means  essential.  In  fact,  any  monotone  nondecreasing  function  of  the  processing 


times  can  be  used.  Then  Properties  1  and  2  will  continue  to  hold  and  the  resulting  algorithm  will 
still  be  valid. 
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5.  Complexity  Results 

In  Section  2  we  formulated  the  MLAA  problem  as  a  min-sum-max  optimization  problem  on 
a  graph.  For  the  case  when  exactly  two  stages  are  associated  with  each  module,  two  polynomial¬ 
time  algorithms  for  this  problem  were  presented  in  Sections  3  and  4.  In  the  present  section  we 
prove  that  for  r  stages  the  MLAA  decision  problem  is  NP-complete.  To  do  this  we  show  that  any 
3 -Satisfiability  decision  problem,  a  well  known  NP-complete  problem  (Garey  and  Johnson  1979), 
can  be  reduced  to  an  instance  of  the  MLAA  decision  problem.  Before  proceeding  with  the  proof 
we  state  the  aforementioned  problems  formally. 

3-SAT 

Instance:  A  collection  C  =  {C|,  C2. c^}  of  clauses  on  a  finite  set  U  =  {x^,  X2, ...,  x^)  of 
0, 1  variables  such  that  each  clause  Cj  is  a  disjunction  of  exactly  three  literals  (xj  or  Xj 
is  a  literal). 

Decision  Problem:  Is  there  a  truth  assignment  for  U  that  satisfies  all  the  clauses  in  C? 

MLAA 

Instance:  A  collection  of  n  modules  and  p  processors.  Associated  with  each  module  i  is  an  r 
dimensional  nonnegative  real  weight  vector  (wj(i),  W2(i),  ...,  w^(i)). 

Decision  Problem.  Is  there  a  partition  of  the  n  modules  into  p  contiguous  intervals  Ij, . . .,  Ip 
so  that  Xt- 1  niax  w  (I )  <  B,  where  B  is  a  positive  real  number?  (As  in  Section  2, 
w^(r)  =  X,f.  I  w^(i).) 

Theorem  1.  The  MLAA  decision  problem  is  NP-complete. 

Proof:  It  is  easy  to  see  that  MLAA  is  a  member  of  class  NP  since  a  nondeterministic  turing 
machine  need  only  guess  a  partition  1  j, . . .,  Ip  and  check  in  polynomial  time  whether  or  not 
Xk=i  max.  w^(I.)  <  B. 
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It  remains  to  show  that  any  instance  of  3-SAT  can  be  transformed  into  a  MLAA  problem. 
For  the  purposes  of  the  proof,  we  define  B  =  30u,  p  =  3u  +  4m,  n  =  4u  +  6m,  r  =  2u  +  1,  and 
F  =  24u.  Here  F  represents  a  parameter  whose  use  will  become  apparent  as  the  proof  proceeds. 
Throughout  the  proof  we  will  use  the  term  bottleneck  weight  to  refer  to  max.  w^(Ij)  relative  to 
some  entry  k.  TTte  transformation  consists  of  two  parts.  The  first  involves  constructing  four 
consecutive  modules  for  each  Xj  €  U;  the  second  appends  a  further  6m  modules  (described  later) 
corresponding  to  the  m  clauses.  Associate  with  each  of  the  first  4u  modules  a  weight  vector  of 
dimension  r.  The  first  2u  entries  of  each  weight  vector  will  be  positionally  associated  with  the 
elements  of  U  as  follows:  (xj,  Xj,  Xj,  X2>  Entry  2u  +  1  is  not  associated  with  a  literal 

and  will  contain  some  fractional  part  of  F.  This  entry  is  used  to  force  particular  assignments  of 
modules  to  processors.  In  particular,  the  values  chosen  for  F,  B,  and  p  will  force  a  bottleneck 
weight  of  F  in  entry  2u  +  1  for  any  feasible  partition  of  the  modules. 

Each  set  of  four  weight  vectors  associated  with  a  variable  Xj  are  assigned  values  in  a  similar 
fashion.  Module  j  s  1  (mod  4)  corresponds  to  the  start  of  a  new  set  of  four  modules  associated 
with  a  v  ariable.  Assume,  without  loss  of  generality,  that  j  s  1  (mod  4)  corresponds  to  the  i^ 
variable.  Each  entry  of  the  weight  vector  for  module)  will  be  0,  except  for  entry  2u  +  1  which  has 
a  value  of  F.  The  weight  vector  for  module  j  +  1  has  a  2  in  entry  2i  -  1,  an  F/2  in  entry  2u  +  1,  and 
O's  elsewhere.  The  weight  vector  for  module  j  +  2  has  a  2  in  entries  2i  -  1  and  2i,  an  F/2  in  entry 
2u  +  1,  and  O’s  elsewhere.  Lastly,  the  weight  vector  for  module  j  +  3  has  a  2  in  entry  2i,  an  F/2  in 
entry  2u  +  1,  and  O's  elsewhere. 

The  values  given  to  F,  B  and  p  insure  that  modules  j  =  1  (mod  4)  are  assigned  to  a  single 
processor  and  modules  j  +  1 ,  j  +  2,  and  j  +  3  are  assigned  to  exactly  two  processors.  In  particular 
if  modules  j  +  1  and  j  +  2  are  assigned  to  the  same  processor  then  bottleneck  weights  of  4  and  2 
result  for  the  components  associated  with  x^  and  Xj  respectively.  We  interpret  this  as  indicating  that 
Xj  is  true  since  4  is  greater  than  2.  The  alternative  assignment  indicates  Xj  is  true. 

The  second  part  of  the  transformation  concerns  clauses.  We  construct  a  sequence  of  6 
consecutive  modules  for  each  clause  in  an  instance  of  3-SAT.  Again  we  associate  a  weight  vector 
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of  length  r  with  each  of  the  6m  modules.  The  construction  of  these  weight  vectors  follows  a 
similar  pattern  for  each  of  the  m  six-module  sets.  Let  v  =  1  (mod  6)  and  j  =  4u  +  v.  The  weight 
vector  for  module  j  has  O’s  in  entries  1  to  2u  and  an  F  in  entry  2u  +  1.  The  weight  vector  for 
module  j  -h  1  has  a  2  for  the  entry  corresponding  to  the  first  literal  in  clause  c  =  [(v  -  1)/61  -t-  1,  a 
2F/3  in  entry  2u  -h  1 ,  and  O’s  elsewhere.  The  weight  vector  for  module  j  +  2  has  a  1  for  the  entries 
corresponding  to  the  first  two  literals  of  clause  c,  an  F/3  in  entry  2u  +  1 ,  and  O’s  elsewhere.  The 
weight  vector  for  module  j  +  3  has  a  1  for  the  entry  corresponding  to  the  second  literal  of  clause 
c,  an  F/3  in  entry  2u  +  I,  and  O’s  elsewhere.  The  weight  vector  for  module  j  +  4  has  a  1  for  the 
entries  corresponding  to  the  last  two  literals  of  clause  c,  an  F/3  in  entry  2u  +  1,  and  O’s  elsewhere. 
l>astly,  the  weight  vector  for  module  j  +  5  has  a  2  for  the  entry  corresponding  to  the  third  literal  of 
clause  c,  a  2F/3  in  entry  2u  +  1,  and  O’s  elsewhere. 

In  view  of  the  values  chosen  for  F,  B,  and  p,  each  module  j  =  4u  +  v,  with  v  =  1  (mod  6) 
must  be  assigned  to  a  single  processor  and  modules  j  +  1 , . . .,  j  -i-  5  must  be  partitioned  among 
exactly  three  processors.  Any  feasible  partition  of  modules  j  +  1, . . .,  j  +  5  into  three  intervals 
must  have  a  bottleneck  weight  of  no  more  than  F  in  entry  2u  +  1.  In  addition,  such  a  feasible 
partition  will  result  in  a  bottleneck  weight  of  3  for  at  least  one  (possibly  two)  of  the  entries 
(literals).  Any  literal  associated  with  a  bottleneck  weight  of  3  is  considered  as  one  satisfying  the 
clause.  Note  that  if  the  literal  was  assigned  the  value  true  earlier  in  the  variable  to  module  assign¬ 
ment.  that  entry  will  already  have  a  bottleneck  weight  of  4  which  will  mask  the  3.  However,  if  a 
literal  selected  by  the  clause  was  not  assigned  the  value  true  earlier,  that  entry’s  bottleneck  weight 
of  2  will  be  increased  by  the  value  3  created  in  the  modules  associated  with  the  clauses. 

This  transformation  of  an  instance  of  3-SAT  into  an  instance  of  MLAA  is  best  illustrated  by 
an  example.  Suppose  S  =  (xj  v  X2  v  X3)  a  (xj  v  X2  v  X3)  is  the  given  3-SAT  problem,  so  that 
Li  =  {Xj,  X2,  X3},  u  =  3,  m  =  2,  n  =  24,  B  =  90,  F  =  72,  and  p  =  17.  The  module  weight 
vectors  corresponding  to  the  three  elements  of  U  are  given  in  Figure  2  and  the  module  weight 
vectors  corresponding  to  the  clauses  are  given  in  Figure  3.  One  solution  for  S  has  Xj,  X2,  and  X3 
true.  A  module  to  processor  assignment  corresponding  to  this  solution  is  given  in  Figure  4.  The 
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bottleneck  vector  associated  with  the  solution  in  Figure  4  is  (4,  2,  4,  2,  2,  4,  F).  The  component¬ 
wise  sum  of  this  vec  tor  is  I-  *  IH  B,  as  expected. 

I  inally  we  must  show  that  a  given  instance  of  3-SAT  is  solved  if  and  only  if  (he  corres- 
[xrnding  instance  of  MLAA  has  a  solution  with  a  cost  of  no  more  than  B  =  30u.  Suppose  that  a 
given  instance  of  3-SAT  has  a  solution.  Our  prior  comments  guarantee  that  the  instance  of  MLAA 
constructed  from  the  given  instance  of  3-SAT  has  an  assignment  for  which  the  sum  of  the  xj  and  Xj 
bottleneck  weights  is  6  for  all  i  =  1, u  and  the  bottleneck  weight  in  the  last  entry  is  F.  There¬ 
fore  we  have  a  solution  with  total  weight  6u  +  F  =  B.  No  solution  to  the  constructed  instance  of 
MLAA  with  cost  less  than  B  is  possible  since  entry  2u  +  1  will  always  contribute  a  bottleneck 
weight  of  F  and  the  bottleneck  weights  corresponding  to  Xj  and  Xj  contribute  a  sum  no  less  than  6, 
for  a  total  weight  of  at  least  B.  It  follows  then  that  any  solution  to  the  instance  of  MLAA  will 
imply  a  variable  assignment  that  is  a  solution  for  the  corresponding  instance  of  3-SAT. 
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Figure  4.  A  solution  to  S  and  the  module  to  processor  assignment. 
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Since  we  can  solve  the  MLAA  decision  problem  by  finding  a  min-sum-max  path  with  p 
edges  in  an  acyclic  graph,  it  follows  that  the  min-sum-max  path  decision  problem  is  NP-complete 


for  acyclic  graphs.  We  state  this  as  a  corollary. 

Corollary  1.  The  Min-Sum-Max  decision  problem  for  acyclic  graphs  is  NP-complete. 

The  NP-completeness  of  the  MLAA  decision  problem  depends  on  treating  r  as  a  problem 
variable.  For  fixed  r  the  natural  extension  of  the  labeling  algorithm  in  Section  3  has  polynomial¬ 
time  complexity.  To  describe  this  extension,  the  only  change  needed  from  our  previous  descrip¬ 
tion  is  in  the  efficient  set  calculation.  The  following  discussion  centers  on  the  labeling  algorithm 
pseudocode  and  complexity  analysis  given  in  Section  3. 

Let  W(i,  j)  =  (W|(i,  j),  W2(i.  j). ...,  w^(i,j)j  be  the  weight  vector  associated  with  the  edge 
between  nodes  i  and  j.  Each  computation  of  the  form  L(i;k-1)  ®  W(i,  j)  performed  in  Steps  2  and 
3  requires  0(r  L^^^)  time;  constructing  the  list  Lj  =  L(i;k-1)  ®  W(i,  j)  requires  0(n  r  L^^^) 

time.  Notice  that  Lj  has  length  L  =  0(n  L^^^).  Kung  et  al.  (1975)  describe  an  0(L  log^'^  L )  time 
algorithm  (for  r  >  2 )  that  can  be  used  to  calculate  EFFICIENT(Lj).  Since  L^^,^  has  ©(m*^  *)  length, 
a  fact  easily  shown  by  induction,  this  extended  labeling  algorithm  has  0(p  n^*"  log*^’^  (n^^  *))  time 
complexity.  For  fixed  r,  MLAA  can  thus  be  solved  in  polynomial  time,  albeit  a  polynomial  of  high 
degree.  The  following  section  outlines  certain  refinements  for  improving  the  performance  of  this 
type  of  algorithm. 

6.  .Mgorithms  for  r  >  2 

The  recursive  bisection  algorithm  of  Section  4  can  also  be  generalized  for  r  >  2  stages: 
REACH  extends  in  a  natural  way,  and  the  feasibility  matrix  becomes  an  r-dimensional  array. 
However,  the  search  process  does  not  scale  well.  In  the  two-stage  problem,  knowledge  that  (x,  y) 
is  the  least  costly  feasible  solution  in  a  row  allows  us  to  ignore  at  least  one-half  of  the  rectangle 
being  searched  and  creates  two  smaller  rectangles  to  search  recursively.  In  an  r-stage  problem  we 
may  similarly  find  the  least  costly  feasible  solution  within  a  “row”  (vectors  that  differ  only  in  a 
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given  dimension).  However,  this  identification  allows  us  to  eliminate  as  little  as  l/Z'"’  of  the 
hyper-rectangle  being  searched  and  creates  up  to  2^-2  hyper-rectangles  that  must  be  searched 
recursively.  Consequently  the  remainder  of  this  section  focuses  on  extensions  of  the  labeling 
algorithm  for  r  >  2.  Specifically,  we  describe  computational  refinements  of  the  labeling  algorithm, 
together  with  two  heuristics  suitable  for  problems  that  are  too  large  for  exact  solution. 

We  have  tested  two  algorithms  for  determining  EFFICIENT(Lj);  the  algorithm  of  Kung  et 
al.  (1975)  and  an  algorithm  that  does  domination  checking  on  the  fly.  On  all  the  problems  tested, 
the  former  algorithm  ran  an  order  of  magnitude  slower.  Therefore,  the  less  sophisticated  (but  for 
our  problems,  apparently  faster)  second  approach  for  calculating  L(j;k)  has  been  subsequently 
used.  This  procedure  constructs,  then  accepts  or  rejects,  each  potential  member  of  L(j;k).  An 
accepted  member  may  cause  previously  accepted  members  to  be  discarded,  and  it  is  entered  only 
once  into  L(j;k).  A  potential  member  is  immediately  rejected  if  it  is  found  to  be  dominated  by  an 
existing  member  of  L(j;  k).  This  algorithm,  listed  below,  requires  0(n  r  LJna,)  time  to  construct 
L(j;k). 

CONSTRLCTJON  ALGORITHM 

1.  L(j;k)  =  0; 

2.  For  i  =  1,  ...,  j-1 

3.  For  all  U  e  L(i;k-1) 

T  =  U®  W(i,j); 

For  all  V  e  L0;k) 
if  T<  V 

L(j;k)  =  L(j;k)-  {V} 
else  if  T  >  V 
go  to  3; 

L(j;k)  =  L(j;k)vj  {T}. 

Let  T  e  L(j;  k)  and  suppose  there  is  a  path  vector  X  e  L(j;  h),  h  <  k,  such  that  X  <  T 
(componentwise  vector  comparison).  If  P  and  Q  are  subpaths  corresponding  to  path  vectors  T  and 
X  respectively,  then  in  any  solution  path  (with  p  edges)  that  uses  P,  the  subpath  P  can  be  replaced 
with  subpath  Q,  yielding  a  new  solution  path  (with  <  p  edges)  having  the  same  or  an  improved 
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solution  cost.  It  follows  from  Property  1  of  Section  4  that  such  a  path  can  be  extended  to  a 
solution  path  with  exactly  p  edges  without  degrading  the  solution  cost.  This  observation  leads  to 
our  first  pruning  rule  -  a  potential  member  T  of  L(j;k)  is  rejected  if  T  >  X  for  some  X  e  L(j;  h) 
with  h  <  k.  Use  of  this  rule  now  increases  the  worst-case  complexity  of  the  L(j;k)  calculation  to 
0(p  n  r  Lnjax).  Nevertheless,  in  empirical  testing  the  rule  was  found  to  eliminate  a  large  number  of 
paths,  and  as  a  result  improved  the  overall  execution  time. 

Further  pruning  is  possible  if  we  know  the  sum-max  cost  c  of  a  feasible  solution.  Let  T  be 
any  potential  member  of  L(j;  k),  corresponding  to  subpath  P.  The  component  sum  of  T  is  a  lower 
bound  on  the  cost  of  any  solution  path  using  P.  If  the  component  sum  of  T  exceeds  c,  we  know 
that  any  solution  involving  P  has  cost  exceeding  c  and  must  therefore  be  suboptimal.  The  vector  T 
is  then  eliminatul  by  this  forward  pruning  rule. 

Backward  pruning  is  also  possible.  For  any  fixed  j,  the  vectors  W(i,  j)  increase  monoton- 
ically  in  each  component  as  i  decreases.  Thus  if  the  sum  of  Wfig,  j^j’s  components  exceeds  c, 
then  for  all  i  <  i^  we  know  that  (i,  Jq)  cannot  be  an  edge  in  an  optimal  path.  For  every  j  we  define 
low(c,  j)  to  be  the  smallest  index  i  such  that  the  component  sum  of  W(i,  j)  is  less  than  c.  Then  in 
Step  2  of  the  construction  algorithm  above  the  lower  bound  of  the  i  loop  can  be  replaced  with 
low(c,  j).  Backward  pruning  also  allows  us  to  free  space  used  to  represent  path  vector  lists,  an 
important  consideration  in  extending  the  size  of  problems  that  can  be  solved  (see  Section  7). 

As  will  be  described  in  Section  7,  the  extended  labeling  algorithm  (with  the  above  pruning 
rules  incorporated)  has  been  successfully  used  on  a  number  of  nontrivial  examples;  in  one  case  we 
solved  a  1024  module,  four-processor,  four-stage  problem,  and  in  another  case  we  solved  a  128 
module,  sixteen-processor,  eight-stage  problem.  Some  problem  classes  (as  the  fluid  flow  problem 
mentioned  in  Section  I )  tend  to  have  a  substantial  number  of  modules,  thus  motivating  the  use  of 
heuristics  for  the  MLAA  problem.  Heuristics  are  also  useful  in  providing  feasible  solutions  for  the 
forward  and  backward  pruning  rules,  and  are  valuable  even  for  the  two-stage  problem  when  n  is 
so  large  that  the  0(p  n^  log  n)  complexity  is  daunting. 
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The  heuristics  to  be  developed  here  are  built  on  the  ability  to  solve  the  one-stage  problem 
very  efficiently,  in  0(p  n  log  n)  time.  This  efficient  technique,  based  on  the  methodology  devel¬ 
oped  in  Nicol  (1988),  is  now  briefly  described.  First  note  that  the  minimum  bottleneck  value  for 
the  one-stage  problem  is  identically  one  of  O(n^)  edge  weights.  A  one-stage  version  of  REACH 
establishes  the  feasibility  of  any  candidate  weight  in  0(p  log  n)  time.  The  set  of  all  edge  weights 
x(i,  j)  can  be  conceptually  organized  in  an  upper  triangular  matrix,  with  i  and  j  being  the  row  and 
column  indices,  respectively.  By  Property  1,  the  entries  in  any  row  increase  monotonically  in  the 
columns,  while  column  entries  decrease  monotonically  in  the  rows.  The  recursive  bisection 
algorithm  of  Section  4  is  then  easily  modified  to  find  the  minimum  feasible  edge  weight  using  0(n) 
REACH  calls,  showing  that  the  one-stage  MLAA  problem  can  be  solved  in  0(p  n  log  n)  time. 

Our  approach  to  developing  heuristics  proceeds  by  projecting  an  r-stage  problem  onto  a 
one-stage  problem,  which  is  then  optimally  solved.  An  optimal  partition  for  the  projected  problem 
yields  an  approximate  solution  for  the  r-stage  problem.  It  wUl  also  be  seen  that  the  ability  to  solve 
one-stage  problems  efficiently  aids  us  in  evaluating  the  quality  of  various  heuristic  solutions. 

Two  projections  arc  quite  natural.  The  first  takes  a  module’s  r-stage  weight  vector  and 
transforms  it  into  a  single-stage  weight  by  summing  all  the  weights.  The  second  method  trans¬ 
forms  a  module’s  weight  vector  by  selecting  the  maximum  component.  The  resulting  methods  arc 
termed  the  sum-projection  and  the  max-projection  heuristics,  respectively.  The  sum-projection 
heuristic  is  of  particular  interest  here.  As  will  be  seen  in  Section  7,  this  heuristic  produces  better 
partitions  in  practice  than  the  max -projection  heuristic.  Furthermore,  it  will  now  be  shown  how 
the  quality  of  a  partition  it  produces  can  be  bounded,  relative  to  the  (unknown)  optimal  partition. 
This  bound  is  particularly  tight  when  the  variance  in  module  weights  is  small. 

For  i  =  1, . . ..  p  let  Wj  be  the  sum  of  execution  times,  at  all  stages,  of  all  modules  assigned 
to  processor  i  under  the  partition  produced  by  the  sum-projection,  with  the  largest  among 
these.  Similarly,  let  Vj  be  the  sum  of  all  workloads  on  processor  i  under  the  optimal  partition  (with 
respect  to  the  r-stage  problem),  and  let  be  the  largest  of  these.  By  the  optimality  of  the  one- 
stage  partition  for  the  sum-projection,  Now  for  j  =  1, ...,  r  let  bj  be  the  bottleneck 
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weight  at  stage  j  for  the  sum-projection  partition,  and  let  sp{j)  denote  the  corresponding  stage  j 
bottleneck  processor.  Then  for  some  positive  Oj  <  1  we  have  bj  =  ttj  The  r-stage  cost  of 

the  sum-projection  partition  is  thus  C^p  =  ''^sp(j)’  y’®^ding 

r 

Csp  ^  X  “j 


Let  Cj,p(  be  the  cost  of  an  optimal  r-stage  partition.  Clearly  <  Cpp^,  because  represents 
the  work  that  must  be  done  by  one  processor  whereas  C^p,  is  the  total  system  finishing  time.  Thus 
ue  have  <  C„p,.  and  consequently 


W„a.'C.p  «  C„„/C,p.  (5 

Now  we  would  like  to  develop  bounds  for  the  left  hand  side  of  this  inequality. 

For  every  module  i,  let  and  m^'^  denote  the  respective  maximum  and  minimum  stage 
weights,  with  Pj  =  and  p*  =  min{pj;  i  =  1, ....  n}.  Now  Oj  is  the  fraction  of  processor 

sp(j)’s  total  workload  resident  at  stage  j,  and  ttj  is  consequently  maximized  if  all  of  the  stage  j 
module  weights  assigned  to  sp(j)  are  as  large  as  possible,  and  if  all  of  the  other  stage  weights  for 
sp(j)  are  as  small  as  possible.  Without  loss  of  generality  suppose  that  k  modules  are  assigned  to 
sp(j)  and  that  sp(j)  is  processor  one.  We  can  then  bound  Oj  from  above  using 


a  <  -- 
J  Jl 


^  ft) 

_ _ 


If  Sj,  . . ..  S|^  and  t, . t,j  are  nonnegative  reals  then  a  straightforward  induction  on  k  shows  that 


<  max 
“  l<i<k 
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Applying  the  above  to  inequality  (6)  produces 


a  <  max 
J  1  <  i  <k 


=  max  - = - 

i<i<k  (r-l)p|  +  1 


(r-l)p‘' 


This  allows  us  to  bound  C„  in  (4)  from  above,  giving 

ip 

C  <  **  ^max 

-  (r-l)p*  +  1 


(7) 


Another  bound  on  C^p  can  be  constructed  in  an  analogous  fashion.  Let  and  s^^  denote 

r  /j\ 

(respectively)  the  maximum  and  minimum  weights  overall  modules  at  stage  j,  with  a  =  s  . 
Then  we  have  otj  ^  "fj'  with 


T 

j 


This  yields  the  second  bound 

C.p<tW„3,,  (8) 

where  x  ,  x^  . 

Applying  the  above  bounds  (7),  (8)  on  C^p  to  (5)  produces  max{  1/x,  |(r-l)p*  +  IJ/r}  < 
^opi'^sp'  which  implies  in  particular  that 

max{  1  T,  p*.  1/r}  <Copj/C^p  .  (9) 

Relation  (9)  provides  in  turn  an  upper  bound  on  how  much  the  cost  of  the  heuristic  solution  varies 
from  the  cost  of  the  optimal  solution,  thus  indicating  the  quality  of  the  sum-projection  heuristic. 

The  bounds  above  are  especially  encouraging  when  module  weight  variation,  either  within 
a  module  or  within  a  stage,  is  known  to  be  low.  In  the  extreme  case  when  a  module’s  stage 
weights  are  all  equal  (so  p*  =  1),  or  when  all  module  weights  within  any  stage  are  equal  (so 
T  1 ).  relation  (9)  shows  that  Cj,p  <  and  thus  the  heuristic  will  in  fact  produce  an  optimal 
partition.  In  Section  7  we  will  encounter  a  problem  class  for  which  the  value  of  p*  is  no  smaller 
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than  1/2.  We  can  then  be  assured  that  the  sum-projection  solution  for  this  problem  class  is  never 
more  than  twice  as  costly  as  the  optimal  solution.  On  the  other  hand,  the  1/t  bound  is  useful  if  any 
module  weights  are  zero  (forcing  p*  =  0).  One  class  of  problems  studied  in  Section  7  frequently 
has  zero  module  weights. 

The  a  priori  bound  (9)  on  solution  quality  is  less  satisfying  when  1/x,  p*,  and  1/r  are  all 

small.  An  alternative  a  posteriori  approach  is  based  on  our  ability  to  quickly  construct  a  lower 

bound  on  C^p^.  Namely,  any  r-stage  problem  can  be  decomposed  by  stage  into  r  separate  one- 

stage  problems.  Let  be  the  optimal  bottleneck  value  of  the  one-stage  problem  constructed  from 

stage  j  module  weights,  and  let  opt,  be  the  stage  j  bottleneck  value  under  an  optimal  partition  for  the 

r 

r-stage  problem.  Clearly  we  must  have  a^  <  opt^  so  that  a.  <  C^p^.  Given  the  cost  Cj,  of  the 
partition  produced  by  some  heuristic,  the  ratio  C^^pj/C^is  bounded  from  below  by  the  ratio 
(i;.  I  /C^.  Each  a^  can  be  found  in  0(p  n  log  n)  time,  making  this  a  feasible  mechanism  for 
bounding  the  quality  of  any  partition  produced  by  a  heuristic.  Our  experience  with  both  exact  and 
heuristic  solution  methods  for  r  stages  is  summarized  in  the  following  section,  together  with 
computational  evidence  concerning  the  bounds  on  solution  quality. 

('omputational  Experience 

We  have  implemented  the  extended  labeling  algorithm,  as  well  as  the  two  projection 
heunstics.  in  the  C  language  on  a  networked,  diskless  SUN/3  workstation  having  4  Mb  of 
memory.  This  section  describes  the  computational  results  obtained  with  these  algorithms  on  three 
piMhlcm  types.  The  performance  benefits  gained  by  using  one  of  these  algorithms  as  a 
.la’p  IK  L  ^sor  for  a  multiprocessor  fluid  flow  code  are  also  discusser'. 

In  our  study  of  the  extended  labeling  algorithm,  the  number  of  modules  (n)  was  chosen  to 
be  a  power  of  2.  The  upper  limit  on  n  varied  by  problem  type,  with  the  largest  n  attempted  being 
1 024.  In  addition,  p  was  chosen  to  be  a  power  of  two  (between  2  and  16),  and  r  was  a  power  of 
two  (between  2  and  8).  Three  classes  of  test  problems  were  studied.  In  the  first  class,  all  module 
weights  were  randomly  chosen  from  the  uniform  probability  distribution  on  f  1,  10001].  In  the 
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second  class,  weights  for  any  given  stage  were  created  using  a  sine  function  over  (0,  27t|  with 
amplitude  50,  frequency  2,  and  translated  upward  in  the  y  direction  by  the  constant  150;  all  module 
weights  are  consequently  between  100  and  200.  To  generate  the  weight  for  module  i,  we  evaluate 
the  sine  function  at  x  =  i  -  27t/(n- 1)  and  retain  the  integer  portion  of  that  evaluation.  The  “phase 
constant”  for  such  a  curve  is  the  smallest  x  e  [0, 27t]  at  which  the  derivative  is  equal  to  1.  The 
curves  for  different  stages  have  different  phase  constants:  the  phase  constant  for  stage  1  is  0,  and 
for  stage  j  >  1  it  is  7t/2j  '.  Figure  5  illustrates  this  weight  generation  procedure.  The  third  class 
contains  problems  modeled  on  the  fluid  flow  example  described  in  Section  1.  For  this  class  we 
used  at  most  four  stages;  these  problems  frequently  have  module  weights  of  zero  in  regions  where 
finer  grids  are  absent. 


/.  z ... 


-/ 
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Figure  5.  Module  weights  in  a  four-stage  sine  wave  problem. 
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This  computational  study  was  devised  to  (i)  determine  how  lai^e  a  problem  can  be  solved 
exactly,  (ii)  determine  whether  the  refinements  proposed  improve  our  ability  to  solve  larger  prob¬ 
lems,  and  (iii)  assess  the  quality  of  solutions  produced  by  the  heuristics,  when  compared  to  the 
exact  solution  obtained  by  the  labeling  algorithm.  In  this  empirical  study,  problems  from  a  given 
class  with  identical  size  parameters  behaved  similarly.  Consequently,  we  feel  justified  in  reporting 
the  results  from  a  single  (representative)  run.  In  the  tabulations  that  follow,  two  problems  differ¬ 
ing  only  in  the  number  of  processors  had  the  same  module  weights. 

Table  2  presents  a  summary  of  results  obtained  using  the  extended  labeling  algorithm 
(utilizing  all  refinements  described  in  Section  6).  Results  for  each  problem  class,  indexed  by  the 
number  of  modules  (n)  and  the  number  of  processors  (p),  display  three  performance  measures. 

The  first  measure  is  elapsed  processing  time  for  the  problem,  in  seconds.  This  time  excludes  that 
required  to  generate  a  feasible  solution  using  the  sum-projection  heuristic,  as  this  amount  was 
insignificant  in  comparison  to  the  execution  time  required  by  the  extended  labeling  algorithm.  If 
the  running  time  has  an  associated  a.sterisk,  that  problem  has  four  stages;  otherwise  it  has  eight 
stages.  The  second  measure  is  the  speed-up  factor  from  using  the  refinements:  namely,  the  ratio  of 
processing  time  without  refinements  (the  basic  algorithm)  to  that  utilizing  the  refinements.  The 
third  measure  is  the  ratio  of  the  maximum  number  of  path  vectors  in  use  at  any  time  while  running 
the  basic  version,  di\  ided  by  the  same  quantity  for  the  version  with  refinements.  The  maximum 
number  of  path  vectors  used  is  an  important  measure,  as  it  quantifies  the  amount  of  storage  needed 
b>'  an  algorithm.  Indeed,  for  the  system  configuration  used  in  conducting  the  tests,  the  limiting 
factor  in  solving  the  largest  problems  turned  out  to  be  the  available  storage  rather  than  the  compu- 
ta  .  :i  ii  iie.  The  symbol  (-)  is  used  in  Table  2  to  indicate  when  the  given  problem  could  not  be 
solved  by  the  basic  algorithm  with  the  available  resources. 

Table  2  shows  that  for  a  variety  of  problem  types  the  extended  labeling  algorithm  can  solve 
problems  of  reasonable  size  in  a  reasonable  amount  of  time.  These  results  also  show  that  the 
refinements  proposed  can  significantly  reduce  the  algorithm’s  execution  time  and  storage  require¬ 
ments,  and  in  doing  so  increase  the  size  of  problems  that  can  be  solved  exactly. 
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We  are  also  interested  in  the  quality  of  solutions  produced  by  the  proposed  heuristics,  and 
in  the  behavior  of  the  a  priori  and  a  posteriori  bounds  on  C^pj/C^p  developed  in  Section  6.  We 
have  applied  the  sum-projection  and  max-projection  heuristics  to  each  problem  listed  in  Table  2  and 
have  computed  the  ratio  of  the  optimal  cost  divided  by  the  heuristic  cost.  We  have  also  computed 
the  a  priori  and  a  posteriori  bounds  for  the  sum-projection  heuristic.  Table  3  displays  the  sample 
mean  and  standard  deviation  of  the  individual  problem  measurements  for  each  problem  class.  For 
the  problems  studied  here,  the  sum-projection  always  achieved  a  better  solution  than  the  max- 
projection,  and  it  typically  obtained  solutions  quite  close  to  optimal.  The  sum-projection’s  a  priori 
bounds  tend  to  be  low,  due  to  significant  variation  in  the  module  weights.  On  the  other  hand,  the  a 
posteriori  bounds  are  reasonably  tight,  thereby  justifying  the  additional  effort  needed  to  compute 
them. 

The  sum-projection  heuristic  has  been  used  to  produce  partitions  for  a  multiprocessor 
implementation  of  the  fluid  flow  problem.  The  multiprocessor  employed  is  a  Flex/32,  at  the 
.NASA  Langley  Research  Center.  The  Flex/32  is  a  shared-memory  architecture,  with  eighteen 
processors  available  for  parallel  processing.  The  grid  problems  studied  consisted  of  2048  coarse 
grid  points  and  four  stages;  each  problem  was  partitioned  for  sixteen  processors.  The  flow 
problems  studied  tended  to  concentrate  fine  grids  at  one  end  of  the  domain.  We  compared  the 
execution  time  of  the  sum-projection  method  against  that  of  the  simplest  partitioning  method, 
which  simply  divides  the  domain  into  sixteen  equi-length  pieces.  On  a  suite  of  five  such  problems, 
computations  partitioned  using  the  sum-projection  heuristic  ran  more  than  twice  as  fast  compared 
to  tlie  equi-length  partition.  Data  taken  from  these  runs  is  presented  in  Table  4.  This  provides 
clear  evidence  that  substantial  pjerformance  benefits  can  be  gained  by  using  the  MLAA  solution 
techniques  on  real  problems,  implemented  on  real  parallel  architectures. 
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Table  2 

Performance  of  the  Extended  Labeling  Algorithm  on  Three  Problem  Classes 
(Four-stage  problems  are  marked  with  *,  all  others  involve  eight  stages) 


Uniform  Random  Problems 


p  =  4 

p=  8 

p  =  16 

time 

time 

space 

time 

time 

space 

time 

time 

space 

n 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

32 

3 

3 

3.5 

8 

6 

6.1 

156 

— 

— 

64 

14 

1 

3.6 

99 

6 

5.9 

286 

- 

- 

128 

10 

3 

3.4 

2147 

- 

- 

409* 

- 

- 

256 

24* 

2 

3.1 

1139* 

- 

- 

Sine  Wave  Problems 


II 

D. 

p  =  8 

p  =  16 

time 

time 

space 

time 

time 

space 

time 

time 

space 

n 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

32 

6 

1 

2.6 

6 

2 

4.5 

7 

2 

7.3 

64 

8 

2 

2.7 

16 

5.3 

8.2 

62 

13 

9.2 

128 

15 

3 

2.9 

406 

- 

- 

2960 

- 

- 

256 

190* 

- 

- 
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Fluid  Flow  Grid  Problems 


p  =  4 

p=8 

p  =  16 

time 

time 

space 

time 

time 

space 

time 

time 

space 

n 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

32 

1* 

3 

1.8 

1* 

3 

3.0 

1* 

4 

3.6 

64 

4* 

1 

2.2 

4* 

2 

4.0 

4* 

4 

6.0 

128 

5* 

4 

3.3 

17* 

3 

5.4 

33* 

6 

5.3 

256 

21* 

3 

2.8 

191* 

- 

- 

232* 

- 

- 

512 

73* 

35 

3.1 

5534* 

- 

- 

1024 

4524* 

- 

- 

Table  3 

Measurements  and  Lower  Bounds  on  Ratio  of  Optimal  Cost  Divided  by  Heuristic  Cost 


Uniform 

Sine 

Grid 

ratio 

mean  deviation 

mean  deviation 

mean  deviation 

max 

sum 

a  priori 

a  posteriori 

0.92  0.04 

0.97  0.02 

0.30  0.07 

0.87  0.05 

0.94  0.20 

0.98  0.01 

0.68  0.05 

0.90  0.03 

0.79  0.11 

0.85  0.08 

0.26  0.02 

0.70  0.15 

Table  4 

Fluid  Flow  Problem  Execution  Times  (in  seconds) 
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run 

serial 

time 

MLAA 

time 

MLAA 

speedup 

standard 

time 

standard 

speedup 

1 

149.9 

14.9 

10.0 

32.8 

4.5 

n 

142.1 

14.3 

9.9 

30.8 

4.6 

3 

130.2 

13.4 

9.7 

30.0 

4.3 

4 

110.2 

12.5 

8.8 

28.4 

3.8 

5 

101.6 

11.7 

8.7 

22.9 

4.4 

8.  Summary 

This  paper  has  considered  the  problem  of  optimally  assigning  modules  to  processors,  so 
that  the  sum  of  maximum  workloads  at  each  stage  is  minimized.  After  formulation  of  this  problem 
as  a  multi-objective  network  optimization  problem,  two  polynomially-bounded  algorithms  were 
developed  for  the  case  of  r  =  2  stages.  The  general  problem  (for  arbitrary  r)  was  demonstrated  to 
be  NP-hard.  In  order  to  solve  large-scale  problems,  two  heuristic  procedures  were  developed 
based  on  projecting  an  r-stage  problem  onto  an  efficiently  solvable  one-stage  problem.  Bounds  on 
the  solution  quality  produced  by  the  heuristic  methods  were  obtained  using  a  pnori  and  a  posteriori 
information.  Computational  experience  was  presented  with  one  of  the  exact  algorithms,  incorpor¬ 
ating  certain  pruning  rules  and  making  use  of  a  feasible  solution  produced  by  the  sum-projection 
heuristic.  The  empirical  results  demonstrate  that  the  exact  method,  suitably  refined,  can  solve 
some  reasonably  large  problems.  For  larger  problems  the  sum-projection  heuristic  procedure 
shows  a  good  deal  of  promise,  exemplified  by  its  success  in  partitioning  processors  for  an  actual 
Huid  flow  problem. 
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